| I confirm that I have fully read and understood the assignment brief for this module. | Y |
| Name: Chung Yan | Surname: Yu |
| Submission Date | 19-10-25 |
| Word Count: whole assignment including codes | |
| Word Count: main body excluding abstract, references and supplementary materials |
This assignment is the result of my own work and includes nothing which is the outcome of work done in collaboration except as declared in the Preface and specified in the text. It is not substantially the same as any that I have previously submitted for a degree or diploma or other qualification at the University of Cambridge or any other university or similar institution, or that is being concurrently submitted, except as declared in the Preface and specific in the text. I further state that no substantial part of my Portfolio has already been submitted, nor is being concurrently submitted for any such degree, diploma or other qualification at the University of Cambridge or any other university or similar institution except as declared in the Preface and specified in the text.
| I confirm the statement of originality as above | Y |
Self-assessment is an important aspect of feedback literacy, which is, in turn, key to the development of expertise. As you proceed through the MSt Healthcare data science programme, we hope that you will make use of the following prompts to assess your own work on assignments. Specific assignment briefs will likely indicate which of these to address for which assessments, but, in general, we expect you to respond to one or two for each assignment on your course.
For each of the questions, do not spend too long answering – keep it brief. For each question you answer, limit yourself to no more than three items. And please remember, this is optional and developmental: these cover sheets are designed to create space for self-assessment and feedback dialogue, rather than additional assignment workload.
| Which permitted use of generative AI are you acknowledging? | Semantic search of literature and notes, outline creation, code debugging, output formatting and generation, informed feedback |
| Which generative AI tool did you use (name and version)? | Claude Code v2.0.3x (the version likely changed over usage), Claude Desktop v1.0.3x with PubMed Connector, M365 CoPilot |
| What did you use the tool for? | Searching for relevant journal publications, sanity check on coding strategies, collating notes in my Obsidian vault, debugging code by parsing error messages |
| How have you used or changed the generative AI’s output | E.g. my collated notes always stay in point form so that I write out the paragraphs in my own words. Feedback provided by the GenAI models are weighed and assessed before being acted on. Code generated are checked, e.g. it tried to run Levene test on a model fitted using residuals as the response, this was rectified back to the relevant response variable. |
Diffuse large B cell lymphoma (DLBCL) is the most common non-Hodgkin lymphoma, with prognosis influenced by diverse factors ranging from demographic features (age, gender) to integrated scoring systems such as the International Prognostic Index (IPI), and genetic mutations altering gene expression levels. The IPI was updated to National Comprehensive Cancer Network International Prognostic Index (NCCN-IPI), where age was updated from a binary category (cut off at 60) to a quaternary category (≤40, 41 - 60, 61 - 75, >75). Oncogenes MYC, BCL2 and BCL6 emerged with strong associations with DLBCL, and further studies have identified over 150 genetic drivers of this disease. Yet how demographic factors interact with this expanding catalogue of genetic markers and the latter’s predictive powers on clinical outcomes remains understudied. This study analysed clinical and genomic data from 1001 DLBCL patients to investigate whether age and gender modify the prognostic associations between gene expression markers and complete therapy response. Age was found to have significant effect on the gene expression levels of 10 genes and the prognostic power of 1 gene on therapy response. The 10 genes BCL2 (p = 0.0002), BLNK (p = 0.003), MYBL1 (p = 0.0096), SH3BP5 (p = 0.0156), ITPKB (p = 0.0203), CCND2 (p = 0.0203), PTPN1 (p = 0.0314), BMF (p = 0.0339), FUT8 (p = 0.0367), LMO2 (p = 0.0377) (all p-values are FDR adjusted), were found to have different expression levels across NCCN-IPI age groups. The CCND2 also had significant interaction effects (Likelihood ratio test adjusted p-value = 0.005523) with age on predicting complete response outcomes, with older (61-75, >75) age groups’ complete response probability increasing with CCND2 expression, while younger (≤40, 41-60) age groups’ complete response probability decreasing with CCND2 expression. This highlights the need to consider the prognostic value of genomic drivers and age in concert rather than individually.
Diffuse large B cell lymphoma (DLBCL) is the most common type of aggressive non-Hodgkin’s lymphoma (NHL) (The Non-Hodgkin’s Lymphoma Classification Project 1997; Wang 2023), with a wide range of factors contributing to prognosis. For a comparable and more comprehensive pre-treatment prognosis model, the international prognostic index (IPI) emerged as a standardised model for DLBCL and other aggressive NHLs (Shipp et al. 1993), combining 5 factors: age, Ann Arbor stage classification, ECOG Performance Status, serum lactate dehydrogenase (LDH) level and the number of extranodal sites of disease. By pooling factors together, the IPI outperformed simple disease stage classification models such as Ann Arbor. (Zhou et al. 2014) went on to improve on the IPI, creating the National Comprehensive Cancer Network International Prognostic Index (NCCN-IPI) as an update to the IPI. A key innovation of the NCCN-IPI was using cubic splines to model the continuous factors of age and LDH levels as higher resolution categorical factors, modifying them from binary factors to quaternary and ternary factors respectively. As prognosis models improved, so did treatment. The first generation chemotherapy of cyclophosphamide, doxorubicin, vincristine, and prednisone (CHOP) had a cure rate of 30-35% (Fisher et al. 1993). With the addition of rituximab to the regimen (R-CHOP) and other advances, DLBCL becomes curable for more than 60% of patients (Johnson et al. 2012).
While age is incorporated into the IPI (and subsequent NCCN-IPI), other demographic factors such as gender are not, despite established evidence for their prognostic involvement. Males demonstrate higher hazard ratios (Carella et al. 2013) and gender-associated pharmacokinetic differences in rituximab lead to worse treatment responses in males (Habermann 2014). These demographic effects raise questions about whether risk factors perform uniformly across patient subgroups. Beyond demographic and clinical factors, a series of genetic alterations have emerged as strong prognostic markers: BCL2 (Gascoyne et al. 1997), BCL6 (Lo Coco et al. 1994) and MYC (Chenevix-Trench et al. 1986). These changes are observed at both the genotypic and phenotypic levels, as translocation mutations and protein overexpression respectively (Petrich et al. 2014). Furthermore, patients with varying combinations of these alterations have shown worse responses to R-CHOP treatments, highlighting the clinical relevance of genomic profiling. Some of these genetic markers, such as BCL6 (Klapper et al. 2012) and MYC (Kurz et al. 2025), have shown association with patient age, adding another layer of interaction that could influence prognosis. Klapper et al. (2012) further demonstrated that IRF4 translocations are associated with age, and along with BCL6 and other genetic drivers, collectively lose prognostic significance when age is incorporated in multivariate models. This indicates that demographic factors may not merely add to genomic risk additively, but could modify how these genomic drivers influence clinical outcomes.
With the advent of machine learning methods came attempts to construct models based on clinical and genomic data. Reddy et al. (2017a) performed whole exome and transcriptome sequencing in 1001 DLBCL patients, identifying 150 genetic drivers including BCL2, BCL6, and MYC alongside numerous newly characterised candidates. By combining genotypic alterations, gene expression data and cell-of-origin classification, they developed a genomic risk model that outperformed the IPI. However, the potential for demographic factors to modify these genomic associations remains understudied. Given that age and gender both demonstrate prognostic significance and influence treatment response, it remains unknown whether genomic risk markers perform uniformly across demographic subgroups. If these demographic factors modify how gene expression levels impact treatment response, such that the same gene expression levels in different gender or age groups lead to different prognostic outcomes, it would be crucial that these differences are identified and applied in future models to prevent inaccurate prognoses.
This study addresses this gap by investigating whether age and gender interact and modify the prognostic associations between gene expression markers and therapy response in DLBCL. The associations between the demographic factors (age & gender) with complete therapy response and the gene expression levels of 21 genes were first considered. Logistic regression models using gene expression levels of 21 genes were fitted with therapy response, and the interaction of demographic factors with these gene expression levels were examined. Exploring how demographic factors interact with known and potential genomic drivers of DLBCL could reveal age or gender-dependent tumour biology that inform therapeutic approaches.
Here the same datasets (Reddy et al. 2017b) from the genomic risk model study by Reddy et al. (2017a) are utilised, specifically their clinical information dataset from Supplementary Table S1 and gene expression dataset from Supplementary Table S2. The clinical information dataset comprises 1001 DLBCL patients with 35 variables. Patients are recorded with anonymised IDs with the variables of interest to us being gender, response to initial therapy, age at diagnosis, and the expression level of MYC, BCL2 and BCL6 genes based on log2 transformation of RNA-sequencing Fragments per kilobase (FPKM) value. These columns are highlighted and used in further analysis not only due to their relevance, but also their relative completeness. The gene expression dataset comprises the gene expression level (log2 transformed again) of 19 genes (1 gene BCL6 is also recorded in the clinical information dataset) for 775 patients recorded using the same anonymised IDs.
The datasets are available via Elsevier’s open
access license, as shown on the ScienceDirect webpage
of Reddy et al. (2017a)’s publication. For
their work, the authors obtained anonymised patient clinical information
and tumours, which were processed in line with a protocol approved by
the Institutional Review Board at Duke University. A total number of 2
datasets were downloaded as Excel files directly from their links on the
aforementioned ScienceDirect, Table
S1 and Table
S2 using the download.file function from R’s
utils package (R Core Team
2025) and read via the readxl package (Wickham and Bryan 2025). The specific sheets
were selected then cleaned to be exported as .csv files,
all this pre-processing is documented in the 01-data-preprocessing.R
script. The processed .csv files were then uploaded to this
project’s GitHub repo as mmc1-ClinicalInformation.csv
(raw) and mmc2-GeneExpression.csv
(raw). This RMarkdown document is self-sufficient to contain the
code for the relevant analysis and plots shown only, for a complete
analysis including assumption testing, post-hoc analysis, tests that
showed insignificant results, please refer to the supplementary-materials
folder of this repository. This document also supports sourcing the
datasets from 3 sources for redundancy: original ScienceDirect links,
processed .csv files on GitHub and local processed
.csv files when this repository is downloaded. For this
knitted HTML document, both datasets were loaded directly from
ScienceDirect’s supplementary materials: S1
Table (clinical data) and S2
Table (gene expression data).
The 2 datasets were then joined using common patient IDs, and 2 new
columns were added. One column was called age_group_nccn,
obtained by transforming the age at diagnosis variable into 4
categorical groups according to cutoffs defined by Zhou et al. (2014)’s NCCN-IPI: \(≤40, 41 - 60, 61 - 75, >75\). Another
column called complete_response was added by taking the
values from the response_therapy column, re-classifying
“Partial response” and “No response” as “Incomplete response” while
keeping “Complete response” unchanged. This was to prepare the dataset
to be fitted for binomial logistic regression. As both datasets had gene
expression levels for BCL6, the two columns were compared to verify
identicality and 1 was discarded. The resulting table was one with 775
and 27 variables, of which include the anonymised patient ID;
demographic variables such as gender (binary: F/M), age (continuous, at
diagnosis) and age group (quaternary, as defined by NCCN-IPI); clinical
information of therapy response (ternary: Complete response, Partial
response, No response); and the log2 transformed gene expression levels
of 21 genes: MYC, BCL2, BCL6, ITPKB, MME, MYBL1, DENND3, NEK6, LMO2,
LRMP, SH3BP5, IRF4, PIM1, ENTPD1, BLNK, CCND2, ETV6, FUT8, BMF, IL16 and
PTPN1.
Gene expression data were limited to these 21 genes available from Reddy et al. (2017b)’s datasets (Supplementary Tables S1 & S2), representing 2 categories. The first category is the “big 3” oncogenic drivers of MYC, BCL2 and BCL6, serving as strong prognostic markers. Moreover, MYC (Kurz et al. 2025) and BCL6 (Klapper et al. 2012) have both shown age-dependent prognostic effects, suggesting age may modify how these markers influence clinical outcomes. The second category comprises 19 cell-of-origin classifier genes (DLBCL subtypes: ABC vs GCB) (Wright et al. 2003), with BCL6 appearing in both categories. Despite their initial utility for classification, gene members such as CCND2 and LMO2 (Lossos et al. 2004) have shown to be strong survival predictors alongside BCL2 and BCL6. More importantly, IRF4 is among the genetic features showing age associations and contributes to genetic complexity that loses prognostic significance when age is incorporated (Klapper et al. 2012), alongside BCL6. Both categories therefore contain genes with potential to interact with demographic factors, making them suitable for testing whether demographic factors modify their prognostic associations.
Overall, data completeness for these 27 columns was exceptional, with
23 columns being 100%, and the remaining 5 columns all with more than
93% completeness. This processing is all done with the 02-data-processing.R
script. Characteristics of the cohort are summarised in Table 1.
Table 1. Baseline Cohort Characteristics
| Characteristic | Total (N=775) | ≤40 years | 41-60 years | 61-75 years | >75 years |
|---|---|---|---|---|---|
| Sample size | N = 775 | n = 75 | n = 249 | n = 285 | n = 135 |
| Gender | |||||
| Female | 338 (43.6%) | 20 (26.7%) | 91 (36.5%) | 129 (45.3%) | 77 (57%) |
| Male | 437 (56.4%) | 55 (73.3%) | 158 (63.5%) | 156 (54.7%) | 58 (43%) |
| Age at diagnosis, years | 61 ± 15.4 | 29.5 ± 8 | 52.1 ± 5.2 | 67.7 ± 4.4 | 80.4 ± 4 |
| Treatment response | |||||
| Complete response | 598 (82.6%) | 67 (89.3%) | 202 (83.1%) | 218 (82.3%) | 94 (77%) |
| Incomplete response | 126 (17.4%) | 8 (10.7%) | 41 (16.9%) | 47 (17.7%) | 28 (23%) |
Demographic and clinical characteristics stratified by NCCN-IPI age
groups.
Values shown as n (%) for categorical variables and mean ±
SD for continuous variables.
NCCN-IPI age groups represent National
Comprehensive Cancer Network International Prognostic Index
classifications.
Treatment response categories combine partial and
no response as “Incomplete response” versus complete response.
Percentages are calculated within each age group for gender
distributions and treatment responses.
First, the variation of gene expression levels of the 21 genes in
different gender and age groups were scrutinised using the Student’s
t-test and one-way Welch ANOVA respectively, and with false discovery
rate (FDR) adjustment for the p-values (Reiner et
al. 2003) due to the number of genes analysed (see 04-data-analysis.R).
Post-hoc Games-Howell tests were then conducted for signficant ANOVA
results. Assumptions for these tests were checked with diagnostic tests
and plots, and the overall distribution of gene expression against age
and gender visually examined (see 03-data-exploration.R).
The normality of the data is already satisfied by the virtue of large N
(\(n = 775\)) and high data
completeness. Second, the association of therapy response completeness
with gender and age groups were also studied with the Chi-squared tests
with expected frequencies checked to be \(>
5\) in 05-response-analysis.R.
Last but not least, logistic regression models were fitted to predict
response completeness and test how age or gender groupings affect
predictions. Null (response ~ 1) models were fitted as the baseline,
with gene expression only models following (response ~ gene expression),
and then with main effect (response ~ gene expression + age/gender), and
then with interaction effects (response ~ gene expression × age/gender).
The effects of age or gender group were investigated independently. The
fitted models were then analysed with the likelihood-ratio test (LRT) to
obtain FDR-corrected p-values. Significant models were then checked for
model fit with goodness-of-fit with Chi-square tests and Akaike
Information Criterion (AIC) scores, along with influential observations
and collinearity (Hodgson et al. 2025),
all recorded in 06-effect-analysis.R
Gene expression levels showed no significant difference across gender groups after t-test with FDR correction. Concerning age groups, 10 genes showed significant difference after Welch ANOVA with FDR correction: BCL2 (p = 0.0002), BLNK (p = 0.003), MYBL1 (p = 0.0096), SH3BP5 (p = 0.0156), ITPKB (p = 0.0203), CCND2 (p = 0.0203), PTPN1 (p = 0.0314), BMF (p = 0.0339), FUT8 (p = 0.0367), LMO2 (p = 0.0377). Post-hoc analysis with Games-Howell showed the age groups were mostly broken down into 2 groups, save for BLNK with 3 groups. These 10 genes along with their expression levels across the 4 age groups, and the similarity between the age groups, are shown in Figure 1.
Figure 1. Gene Expression Patterns Across Age Groups.
Fig. 1
Box plots show expression distribution by
age group for genes with significant ANOVA results (FDR-adjusted p <
0.05).
Compact letter display (CLD) letters (a, b, c) above each
boxplot indicate statistical groupings from Games-Howell post-hoc tests;
groups with different letters differ significantly (p < 0.05).
Sample sizes (n=X) are shown below each age group.
Individual data
points are overlaid with outliers highlighted in darker borders.
Concerning therapy response across gender and age groups, both showed insignificant differences, with the complete response distribution almost uniform across genders (p = 0.9999999999999984). The distribution across age groups showed some variations, but statistically insignificant (p = 0.173), with the trend of incomplete response proportion rising with age. To examine the raltion between gene expression levels and treatment response, logistic regression analysis with gender or age group interaction for 21 genes was performed. After fitting logistic regression models for all 21 genes against null, gene effect, main effect and interaction effet models, 2 genes of interest emerged with statistical significant. First the MYC gene showed significant gene effects (gene model LRT FDR-adjusted p-value = 0.009954), with acceptable goodness-of-fit score (chi-square p-value = 0.9467) as well as the best (lowest) AIC score (647.57) out of the 4 models for MYC and scoring 10.22 lower than the null model. Second, the CCND2 gene showed significant age interaction effects (age intereaction model LRT FDR-adjusted p-value = 0.005523), with also good goodness-of-fit score (chi-square p-value = 0.9726), and again the best (lowest) AIC score (643.12) out of the 4 models for CCND2, scoring at least more than 12 points lower than all the other models. No other gene showed significance for gene, main nor interaction (age and gender both) effects. The odds ratios of the gene effects of all 21 genes are show in Figure 2.
Looking further at CCND2 age-interaction model’s logistic regression plot (Figure 3), it becomes clear that higher CCND2 expression levels affect the response completeness of the age groups differently. For CCND2, the curves for age groups ≤40 and 41-60 and the curves for age groups 61-75 and >75 are going in opposite directions. Complete response probability decreases with CCND2 expression level for the 2 younger groups, while complete response probability increases with CCND2 expression level for the 2 older groups. This sharply contrasts with the logistic regression plot for MYC’s age-interaction model, where all 4 age groups’ curves are going in the same direction, with complete response probability decreasing for all age groups when MYC expression increases. Further scrutinising the odds ratio of the 4 age groups for the CCND2 age-interaction model (Figure 4), both older age groups (61-75, >75) showed significant higher complete response odds, while the trends for the younger age groups (≤40, 41-60) are worse odds for complete response, they were not significant.
Figure 2. Logistic Regression Models - Gene Expression Effects on Complete Response
Fig. 2
Orange: Adjusted p-values < 0.05
Teal: Adjusted p-values ≥ 0.05, unadjusted p-values <
0.05
Light Purple: Unadjusted p-values ≥ 0.05
(insignificant)
Confidence interval: 95%
MYC is the only gene
showing significant gene effect after FDR correction, showing even with
confidence interval error bars, it is well clear of the 1.0 reference
(no effect) line.
Figure 3. Logistic Regression Models - Gene Expression Effects on Complete Response
B. Logistic regression plot of CCND2 gene expression against complete response probability, age-interaction model